single machine
DFacTo: Distributed Factorization of Tensors
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two sparse matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million 2.5 million 1.5 million dimensional tensor with 1.2 billion non-zero entries.
- Africa > Senegal > Kolda Region > Kolda (0.05)
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Review for NeurIPS paper: HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory
The paper makes an inaccurate claim about the presence of billion-scale ANNS solutions. The performance gain of the proposed HM-ANN algorithm seems marginal when considering its learning curve in practice. The experiments do not evaluate the performance of data fetching. So it is hard to conclude that the proposed HM-ANN achieves better utilization of HM. The paper claims that the proposed HM-ANN is the first billion-scale ANNS solution on a single machine, without using compression (see the last paragraph of Introduction).
DFacTo: Distributed Factorization of Tensors
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two sparse matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million 2.5 million 1.5 million dimensional tensor with 1.2 billion non-zero entries.
- Africa > Senegal > Kolda Region > Kolda (0.05)
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
DFacTo: Distributed Factorization of Tensors
We present a technique for significantly speeding up Alternating Least Squares (ALS) and Gradient Descent (GD), two widely used algorithms for tensor factorization. By exploiting properties of the Khatri-Rao product, we show how to efficiently address a computationally challenging sub-step of both algorithms. Our algorithm, DFacTo, only requires two sparse matrix-vector products and is easy to parallelize. DFacTo is not only scalable but also on average 4 to 10 times faster than competing algorithms on a variety of datasets. For instance, DFacTo only takes 480 seconds on 4 machines to perform one iteration of the ALS algorithm and 1,143 seconds to perform one iteration of the GD algorithm on a 6.5 million 2.5 million 1.5 million dimensional tensor with 1.2 billion non-zero entries.
- Africa > Senegal > Kolda Region > Kolda (0.05)
- North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
- North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
Distributed Learning: A Primer. Behind the algorithms that make Machine…
Distributed learning is one of the most critical components in the ML stack of modern tech companies: by parallelizing over a large number of machines, one can train bigger models on more data faster, unlocking higher-quality production models with more rapid iteration cycles. But don't just take my word for it. Using customized distributed training […] allows us to iterate faster and train models on more and fresher data. Our experiments show that our new large-scale training methods can use a cluster of machines to train even modestly sized deep networks significantly faster than a GPU, and without the GPU's limitation on the maximum size of the model. We sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud.
Minimalistic Predictions to Schedule Jobs with Online Precedence Constraints
Lassota, Alexandra, Lindermayr, Alexander, Megow, Nicole, Schlöter, Jens
We consider non-clairvoyant scheduling with online precedence constraints, where an algorithm is oblivious to any job dependencies and learns about a job only if all of its predecessors have been completed. Given strong impossibility results in classical competitive analysis, we investigate the problem in a learning-augmented setting, where an algorithm has access to predictions without any quality guarantee. We discuss different prediction models: novel problem-specific models as well as general ones, which have been proposed in previous works. We present lower bounds and algorithmic upper bounds for different precedence topologies, and thereby give a structured overview on which and how additional (possibly erroneous) information helps for designing better algorithms. Along the way, we also improve bounds on traditional competitive ratios for existing algorithms.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland (0.04)
- Europe > Germany > Bremen > Bremen (0.04)
- (2 more...)
Run.AI raises $13M for its distributed machine learning platform
Tel Aviv's Run.AI, a startup that is building a new virtualization and acceleration platform for deep learning, is coming out of stealth today. As a part of this announcement, the company also announced that it has now raised a total of $13 million. This includes a $3 million seed round from TLV Partners and a $10 million Series A round led by Haim Sadger's S Capital and TLV Partners. It's no secret that building deep learning models take a hefty amount of GPU power or access to specialized AI chips. Run.AI argues that the virtualization layers that worked so well for in the past don't quite cut it for training today's AI models.
CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures for Large-scale Analysis
Dinari, Or, Zamir, Raz, Fisher, John W. III, Freifeld, Oren
In unsupervised learning, Bayesian Nonparametric (BNP) mixture models, exemplified by the Dirichlet-Process Mixture Model (DPMM), provide a principled approach for Bayesian modeling while adapting the model complexity to the data. This contrasts with finite mixture models whose complexity is determined manually or via model-selection methods. To fix ideas, an important DPMM example is the Dirichlet-Process Gaussian Mixture Model (DPGMM), a Bayesian -dimensional extension of the classical Gaussian Mixture Model (GMM). Despite their potential, however, and although researchers have used them successfully in numerous applications during the last two decades, DPMMs still do not enjoy wide popularity among practitioners, largely due to computational bottlenecks that exist in current algorithms and/or implementations. In particular, one of the missing pieces is the availability of software tools that: 1) can efficiently handle DPMM inference in large datasets; 2) are user-friendly and can also be easily modified. We argue that in order for DPMMs to become a practical choice for large-scale data analysis, implementations of DPMM inference must leverage parallel-and distributed-computing resources (in an analogy, consider how advances in GPU computing and GPU software contributed to the success of deep learning). This is because of not only potential speedups but also memory and storage considerations. For example, this is especially true in distributed mobile robotic sensing applications where multiple autonomous agents working together have limited computational and communication resources. As another motivating example, consider unsupervised dataanalysis tasks in large and high-dimensional computer-vision datasets.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Southern District > Beer-Sheva (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.86)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.66)
A Guide to Parallel and Distributed Deep Learning for Beginners
In recent years, we have witnessed the success of deep learning across multiple domains. But we have also seen that due to the large size and computational complexities of the models and data, the performance of the deep learning procedures is reduced. To improve the performance of these models, parallel and distributed deep learning approaches have been introduced. In this article, we are going to discuss parallel and distributed deep learning methods in detail and will try to understand how they help in speeding up the deep learning process. The major points to be discussed in this article are listed below.